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Abstract — This work explores fundamental modeling and 
algorithmic issues arising in the well-established MapReduce 
framework. First, we formally specify a computational model 
for MapReduce which captures the functional flavor of the 
paradigm by allowing for a flexible use of parallelism. Indeed, 
the model diverges from a traditional processor-centric view 
by featuring parameters which embody only global and local 
memory constraints, thus favoring a more data-centric view. 
Second, we apply the model to the fundamental computation 
task of matrix multiplication presenting upper and lower 
bounds for both dense and sparse matrix multiplication, which 
highlight interesting tradeoffs between space and round com- 
plexity. Finally, building on the matrix multiplication results, 
we derive further space-round tradeoffs on matrix inversion 
and matching. 

^Teyvforrfi-Algorithms for Distributed Computing; Algo- 
rithms for High Performance Computing; Parallel Algorithms; 
Parallel Complexity Theory. 

I. Introduction 

In recent years, MapReduce has emerged as a compu- 
tational paradigm for processing large-scale data sets in a 
series of rounds executed on conglomerates of commodity 
servers H], and has been widely adopted by a number 
of large Web companies (e.g., Google, Yahoo!, Amazon) 
and in several other applications (e.g., GPU and multicore 
processing). (See 01 and references therein.) 

Informally, a MapReduce computation transforms an in- 
put set of key-value pairs into an output set of key-value 
pairs in a number of rounds, where in each round each pair 
is first individually transformed into a (possibly empty) set 
of new pairs (map step) and then all values associated with 
the same key are processed, separately for each key, by an 
instance of the same reduce function (simply called reducer 
in the rest of the paper) thus producing the next new set 
of key-value pairs (reduce step). In fact, as already noticed 
in fSl, a reduce step can clearly embed the subsequent map 
step so that a MapReduce computation can be simply seen 
as a sequence of rounds of (augmented) reduce steps. 

The MapReduce paradigm has a functional flavor, in that 
it merely requires that the algorithm designer decomposes 



the computation into rounds and, within each round, into 
independent tasks through the use of keys. This enables 
parallelism without forcing an algorithm to cater for the 
expUcit allocation of processing resources. Nevertheless, the 
paradigm implicitly posits the existence of an underlying 
unstructured and possibly heterogeneous parallel infrastruc- 
ture, where the computation is eventually run. While mostly 
ignoring the details of such an underlying infrastructure, ex- 
isting formalizations of the MapReduce paradigm constrain 
the computations to abide with some local and aggregate 
memory limitations. 

In this paper, we look at both modeling and algorithmic 
issues related to the MapReduce paradigm. We first provide 
a formal specification of the model, aimed at overcoming 
some limitations of the previous modeling efforts, and then 
derive interesting tradeoffs between memory constraints and 
round complexity for the fundamental problem of matrix 
multiplication and some of its apphcations. 

A. Previous work 

The MapReduce paradigm has been introduced in HI 
without a fully-specified formal computational model for 
algorithm design and analysis. Triggered by the popularity 
quickly gained by the paradigm, a number of subsequent 
works have dealt more rigorously with modeling and algo- 
rithmic issues f4l, f5l, fE]. 

In [4J, a MapReduce algorithm specifies a sequence of 
rounds as described in the previous section. Somewhat 
arbitrarily, the authors impose that in each round the memory 
needed by any reducer to store and transform its input 
pairs has size 0{n^^'^), and that the aggregate memory 
used by all reducers has size O (n^"^'^), where n denotes 
the input size and e is a fixed constant in (0, 1). The 
cost of local computation, that is, the work performed by 
the individual reducers, is not explicitly accounted for, but 
it is required to be polynomial in n. The authors also 
postulate, again somewhat arbitrarily, that the underlying 
parallel infrastructure consists of 8 {'n^~'^) processing el- 
ements with 8 {n^^'^) local memory each, and hint at a 



possible way of supporting the computational model on 
such infrastructure, where the reduce instances are scheduled 
among the available machines so to distribute the aggregate 
memory in a balanced fashion. It has to be remarked that 
such a distribution may hide non negligible costs for very 
fine-grained computations (due to the need of allocating 
multiple reducer with different memory requirements to a 
fixed number of machines) when, in fact, the algorithmic 
techniques of [4| do not fully explore the larger power of 
the MapReduce model with respect to a model with fixed 
parallelism. In [3 1 the same model of |4| is adopted but when 
evaluating an algorithm the authors also consider the total 
work and introduce the notion of work-efficiency typical of 
the literature on parallel algorithms. 

An alternative computational model for MapReduce is 
proposed in ||5l, featuring two parameters which describe 
bandwidth and latency characteristics of the underlying 
communication infrastructure, and an additional parameter 
that limits the amount of I/O performed by each reducer. 
Also, a BSP-like cost function is provided which combines 
the internal work of the reducers with the communication 
costs incurred by the shuffling of the data needed at each 
round. Unlike the model of |4|, no limits are posed to the 
aggregate memory size. This implies that in principle there 
is no limit to the allowable parallelism while, however, 
the bandwidth/latency parameters must somewhat reflect 
the topology and, ultimately, the number of processing 
elements. Thus, the model mixes the functional flavor of 
MapReduce with the more descriptive nature of bandwidth- 
latency models such as BSP Q, 10. 

A model which tries to merge the spirit of MapReduce 
with the features of data-streaming is the MUD model of 
j6^|, where the reducers receive their input key-value pairs 
as a stream to be processed in one pass using small working 
memory, namely polylogarithmic in the input size. A similar 
model has been adopted in |9|. 

MapReduce algorithms for a variety of problems have 
been developed on the aforementioned MapReduce variants 
including, among others, primitives such as prefix sums, 
sorting, random indexing 13 , and graph problems such as 
triangle counting fTOl minimum spanning tree, s-t connec- 
tivity, [4], maximal and approximate maximum matching, 
edge cover, minimum cut [3], and max cover I9j. Moreover 
simulations of the PRAM and BSP in MapReduce have 
been presented in H, Q. In particular, it is shown that 
a T-step EREW PRAM algorithm can be simulated by 
an O (T)-round MapReduce algorithm, where each reducer 
uses constant-size memory and the aggregate memory is 
proportional to the amount of shared memory required by 
the PRAM algorithm f4l. The simulation of CREW or 
CRCW PRAM algorithms incurs a further O (log„ (M/m)) 
slowdown, where m denotes the local memory size available 
for each reducer and AI the aggregate memory size |5|. 

All of the aforementioned algorithmic efforts have been 



aimed at achieving the minimum number of rounds, possibly 
constant, provided that enough local memory for the reducer 
(typically, sublinear yet polynomial in the input size) and 
enough aggregate memory is available. However, so far, to 
the best of our knowledge, there has been no attempt to 
fully explore the tradeoffs that can be exhibited for specific 
computational problems between the local and aggregate 
memory sizes, on one side, and the number of rounds, on 
the other, under reasonable constraints of the amount of total 
work performed by the algorithm. Our results contribute to 
filling this gap. 

Matrix multiplication is a building block for many prob- 
lems, including matching [11], matrix inversion lfT2ll . all- 
pairs shortest path |12), graph contraction (13], cycle detec- 
tion lfT4l . and parsing context free languages [15]. Parallel 
algorithms for matrix multiplication of dense matrices have 
been widely studied: among others, we remind lfT6l . ifTTll 
which provide upper and lower bounds exposing a tradeoff 
between communication complexity and processor memory. 
For sparse matrices, interesting results are given in [Tsl, 
|19| for some network topologies like hypercubes, in [20] 
for PRAM, and in |I2T1 for a BSP-like model. In particular, 
techniques in ifTTI . ||20 ! are used in the following sections for 
deriving efficient MapReduce algorithms. In the sequential 
settings, some interesting works providing upper and lower 
bounds are ll22]| . Il23l for dense matrix multiplication, and 
HH, ||25]| . Il26l for sparse matrix multiphcation. 

B. New results 

The contribution of this paper is twofold, since it targets 
both modeling and algorithmic issues. 

We first formally specify a computational model for 
MapReduce which captures the functional flavor of the 
paradigm by allowing a flexible use of parallelism. More 
specifically, our model generalizes the one proposed in [4] 
by letting the local and aggregate memory sizes be two 
independent parameters, m and M, respectively. Moreover 
our model makes no assumption on the underlying execution 
infrastructure, for instance it does not impose a bound on 
the number of available machines, thus fully decoupling the 
degree of parallelism exposed by a computation from the 
one of the machine where the computation will be eventu- 
ally executed. This decoupling greatly simplifies algorithm 
design, which has been one of the original objectives of the 
MapReduce paradigm. (In Section HIl we quantify the cost 
of implementing a round of our model on a system with 
fixed parallelism.) 

Our algorithmic contributions concern the study of at- 
tainable tradeoffs in MapReduce for several variants of the 
fundamental primitive of matrix multiplication. In particular, 
building on the well-established three-dimensional algorith- 
mic strategy for matrix multiplication 1161 . we develop upper 
and lower bounds for dense-dense matrix multiplication and 
provide similar bounds for deterministic and/or randomized 



algorithms for sparse-sparse and sparse-dense matrix mul- 
tiplication. The algorithms are parametric in the local and 
aggregate memory constraints and achieve optimal or quasi- 
optimal round complexity in the entire range of variability 
of such parameters. Finally, building on the matrix mul- 
tiplication results, we derive similar space-round tradeoffs 
for matrix inversion and matching, which are important by- 
products of matrix multiplication. 

C. Organization of the paper 

The rest of the paper is structured as follows. In Section HIl 
we introduce our computational model for MapReduce 
and describe important algorithmic primitives (sorting and 
prefix sums) that we use in our algorithms. Section |lll] 
deals with matrix multiplication in our model, presenting 
theoretical bounds to the complexity of algorithms to solve 
this problem. We apply these results in Section |IV] to derive 
algorithms for matrix inversion and for matching in graphs. 

II. Model definition and basic primitives 

Our model is defined in terms of two integral parameters 
M and m, whose meaning will be explained below, and is 
named MR(m, M). Algorithms specified in this model will 
be referred to as MR-algorithms. An MR-algorithm specifies 
a sequence of rounds: the r-th round, with r > 1 transforms 
a multiset W,- of key-value pairs into two multisets Wr+i 
and Or of key-value pairs, where Wr+i is the input of 
the next round (empty, if r is the last round), and Or is 
a (possibly empty) subset of the final output. The input 
of the algorithm is represented by Wi while the output 
is represented by \Jr>iOr, with U denoting the union of 
multisets. The universes of keys and values may vary at 
each round, and we let Ur denote the universe of keys of 
Wr- The computation performed by Round r is defined by a 
reducer function pr which is applied independently to each 
multiset Wr,k ^ Wr consisting of all entries in Wr with key 
k e Ur- 

Let n be the input size. The two parameters M and m 
specify the memory requirements that each round of an 
MR-algorithm must satisfy. In particular, let rrir.k denote 
the space needed to compute Pr{Wr,k) on a RAM with 
O (logn)-bit words, including the space taken by the input 
(i.e., mr.k > |W^r.fc|) and the work space, but excluding 
the space taken by the output, which contributes either to 
Or (i.e., the final output) or to Wr+i- The model imposes 
that rrir^k G O (to), for every r > 1 and k € Ur, 
that J2keu ^r,k G O (Af), for every r > 1, and that 
J2r>i^r = 0{M). The complexity of an MR-algorithm 
is the number of rounds that it executes in the worst case, 
and it is expressed as a function of the input size n and of 
parameters to and M. The dependency on the parameters 
TO and M allows for a finer analysis of the cost of an MR- 
algorithm. 



As in [4], we require that each reducer function runs in 
time polynomial in n. In fact, it can be easily seen that 
the model defined in |4| is equivalent to the MR(to, AI) 
model with to e O (n^~^) and M e O (n^^^'^), for 
some fixed constant e e (0, 1), except that we eliminate 
the additional restrictions that the number of rounds of an 
algorithm be polylogarithmic in n and that the number of 
physical machines on which algorithms are executed are 
8 (n^^'^), which in our opinion should not be posed at the 
model level. 

Compared to the model in fTl], our MR(to, M) model 
introduces the parameter M which limits the size of the 
aggregate memory required at each round, whereas in lIZTll 
this size is virtually unbounded. Moreover, the complexity 
analysis in MR(to, M) focuses on the tradeoffs between m 
and M, on one side, and the number of rounds on the other 
side, while in (2J] a more complex cost function is defined 
which accounts for the overall message complexity of each 
round, the time complexity of each reducer computation, and 
the latency and bandwidth characteristics of the executing 
platform. 

A. Sorting and prefix sum computations 

Sorting and prefix sum primitives are used in the algo- 
rithms presented in this paper. The input to both primitives 
consists of a set of n key- value pairs {i, ai) with < z < n 
and flj G S, where S denotes a suitable set. For sorting, a 
total order is defined over S and the output is a set of n 
key- value pairs (i, bi), where the bi's form a permutation 
of the fli's and < hi for each < i < n. For prefix 
sums, a binary associative operation is defined over S and 
the output consists of a collection of n pairs (i, bi) where 
hi — ao Q) ... ® Gi, for < i < n. 

By straightforwardly adapting the results in [5J to our 
model we have: 

Theorem 1. The sorting and prefix sum primitives for inputs 
of size n can be implemented in MR{m,M) with round 
complexity 

O (log„ n) , 

for M = Q (n). 

We remark that the each reducer in the implementation 
of the sorting and prefix primitives makes use of Q (m) 
memory words. Hence, the same round complexity can be 
achieved in a more restrictive scenario with fixed parallelism. 
In fact, our MR(m, M) model can be simulated on a 
platform with 8 (M/ni) processing elements, each with 
internal memory of size 8 (to), at the additional cost of one 
prefix computation per round. Therefore, O (log,„ n) can be 
regarded as an upper bound on the relative power of our 
model with respect to one with fixed parallelism. 

Goodrich (Tl] claims that the round complexities stated in 
Theorem[T]are optimal for any M = ^l (n) as a consequence 



of the lower bound for computing the OR of n bits on the 
BSP model [28 1. It can be shown that the optimality carries 
through to our model where the output of a reducer is not 
bounded by m. 

III. Matrix multiplication 

Let A and B be two ^/n x ^/n. matrices and let C = A-B. 
We use ai,j,bij and Ci,j, with < i,j < ^/n, to denote 
the entries of A, B and C, respectively. In this section we 
present upper and lower bounds for computing the product 
C in MR(m, Af). The algorithms we present envision the 
matrices as conceptually divided into submatrices of size 
■^m X ^/m, and we denote these matrices with Aij, Bij 
and Cij, respectively, for < i,j < yjnjm. Clearly, C^.j = 

All our algorithms exploit the following partition of the 
(n/mY^'^ products between submatrices (e.g., Ai^h ■ Bh,j) 
into ^J'nfm groups: group Gi, with < £ < ^/n/m, 
consists of products Ai^h ■ Bhj, for every < z, j < ^Jnjm 
and for h — (« + j + ^) mod yjn/m. Observe that each 
submatrix of A and B occurs exactly once in each group 

We focus our attention on matrices whose entries belong 
to a semiring (S*, ®, 0) such that for any a G 5 we have a© 
= 0, where is the identity for ©. In this setting, efficient 
matrix multiplication techniques such as Strassen's cannot 
be employed. Moreover, we assume that the inner products 
of any row of A and of any column of B with overlapping 
nonzero entries never cancel to zero, which is a reasonable 
assumption when computing over natural numbers or over 
real numbers with a finite numerical precision. 

In our algorithms, any input matrix X {X — A, B) is 
provided as a set of key-value pairs (fc; j, {i,j,Xi,j)) for all 
elements Xij ^ 0. Key kij represents a progressive index, 
e.g., the number of nonzero entries preceding Xij in the 
row-major scan of X. We call a y/n x y/ii matrix dense if 
the number of its nonzero entries is Q (n), and we call it 
sparse otherwise. We suppose that M is sufficiently large to 
contain the input and output matrices. In what follows, we 
present different algorithms tailored for the multiplication of 
dense-dense (Section IIII-Ab . sparse-sparse (Section IIII-Bl i. 
and sparse-dense matrices (Section Illl-Ct . We also derive 
lower bounds which demonstrate that our algorithms are 
either optimal or close to optimal (Section IIII-Db . and an 
algorithm for estimating the number of nonzero entries in 
the product of two sparse matrices (Section [lII-B4b . 

A. Dense-Dense Matrix Multiplication 

In this section we provide a simple, deterministic al- 
gorithm for multiplying two dense matrices, which will 
be proved optimal in Subsection IIII-DI The algorithm is 
a straightforward adaptation of the well-established three- 
dimensional algorithmic strategy for matrix multiplication 
of iflTl . lfT6l . however we describe a few details of its 



implementation in MR(m, M) since the strategy is also at 
the base of algorithms for sparse matrices. W.l.o.g. we may 
assume that m < 2n, since otherwise matrix multiplica- 
tion can be executed by a trivial sequential algorithm. We 
consider matrices A and B as decomposed into ^/m x y/m 
submatrices and subdivide the products between submatrices 
into groups as described above. 

In each round, the algorithm computes all products within 
K = min{A//7i, y/n/m} consecutive groups, namely, at 
round r > 1, all multiplications in Gi are computed, with 
{r — l)K < £ < rK. The idea is that in a round all submatri- 
ces of A and B can be replicated K times and paired in such 
a way that each reducer performs a distinct multiplication 
in y^[r-i)K<i<rKGi- Then, each reducer sums the newly 
computed product to a partial sum which accumulates all 
of the products contributing to the same submatrix of C 
belonging to groups with the same index modulo K dealt 
with in previous rounds. At the end of the ^pn j (K ^fm)- 
th round, all submatrix products have been computed. The 
final matrix C is then obtained by adding together the K 
partial sums contributing to each entry of C through a prefix 
computatiorQ. We have the following result. 

Theorem 2. The above MR(m^ M)-algorithm multiplies two 

y/n X dense matrices in 

rounds. 

Proof: The algorithm clearly complies with the memory 
constraints of MR(m, M) since each reducer multiplies two 
y/rn X y/m submatrices and the degree of replication is 
such that the algorithm never exceeds the aggregate memory 
bound of M. Also, the {njvrif'l'^ products are computed in 
r?!"^ I (My/m) rounds, while the final prefix computation 
requires O (logm K = O (log„ n) rounds ■ 
We remark that the multiplication of two ^fn x ^pn dense 
matrices can be performed in a constant number of rounds 
whenever m = (n^), for constant e > 0, and Aly/m — 

B. Sparse-Sparse Matrix Multiplication 

Consider two ^/n x ^/n sparse matrices A and B and 
denote with n < n the maximum number of nonzero entries 
in any of the two matrices, and with 6 the number of 
nonzero entries in the product C = A-B. Below, we 
present two deterministic MR-algorithms (Dl and D2) and 
a randomized one (Rl), each of which turns out to be more 
efficient than the others for suitable ranges of parameters. 
We consider only the case m < 2h, since otherwise matrix 

' The details of the key assignments needed to perform the necessary data 
redistiibutions among reducers are tedious but straightforward, and will be 
provided in the full version of this abstract. 



multiplication can be executed by a trivial one-round MR- 
algorithm using only one reducer We also assume that the 
value h is provided in input. (If this were not the case, such 
a value could be computed with a simple prefix computation 
in O (log„j n) rounds, which does not affect the asymptotic 
complexity of our algorithms.) However, we do not assume 
that o is known in advance since, unlike n, this value cannot 
be easily computed. In fact, the only source of randomization 
in algorithm Rl stems from the need to estimate o. 

1) Deterministic algorithm Dl: This algorithm is based 
on the following strategy adapted from |20|. For < i < 
y/n, let ai (resp., hi) be the number of nonzero entries in 
the ith column of A (resp., ith row of B), and let Vi be 
the set containing all nonzero entries in the ith column of A 
and in the ith row of B. It is easily seen that all of the aibi 
products between entries in Ti (one from A and one from B) 
must be computed. The algorithm performs a sequence of 
phases as follows. Suppose that at the beginning of Phase t, 
with t > {), all products between entries in F,, for each 
i < r — 1 and for a suitable value r (initially, r = 0), have 
been computed and added to the appropriate entries of C. 
Through a prefix computation. Phase t computes the largest 
Kt such that X^jif^' '^j^j — Then, all products between 
entries in Fj, for every r < j < r + Kt, are computed using 
one reducer with constant memory for each such product. 
The products are then added to the appropriate entries of C 
using again a prefix computation. 

Theorem 3. Algorithm Dl multiplies two sparse ^fn x ^/n 
matrices with at most h nonzero entries each in 



O 



nminjn, \/n\ 
M 



log„M 



rounds, on an MR{m, M). 



Proof: The correctness is trivial and the memory con- 
straints imposed by the model are satisfied since in each 
phase at most M elementary products are performed. The 
theorem follows by observing that the maximum number 
of elementary products is nminjn, y/n] and that two con- 
secutive phases compute at least M elementary products in 
O (log^ M) rounds. ■ 
2) Deterministic algorithm D2: The algorithm exploits 
the same three-dimensional algorithmic strategy used in the 
dense-dense case and consists of a sequence of phases. In 
Phase t, t > 0, all y/m x -^m-size products within Kt 
consecutive groups are performed in parallel, where Kt is 
a phase-specific value. Observe that the computation of all 
products within a group Gi requires space Mi E [n,h + o], 
since each submatrix of A and B occurs only once in 
Gf and each submatrix product contributes to a distinct 
submatrix of C. However, the value M( can be determined 
in 9 (n) space and O (logm n) rounds by "simulating" 
the execution of the products in Ge (without producing 
the output values) and adding up the numbers of nonzero 



entries contributed by each product to the output matrix. 
The value Kt is determined as follows. Suppose that, at 
the beginning of Phase t, groups Gi have been processed, 
for each £ < r — 1 and for a suitable value r (initially, 
r = 0). The algorithm replicates the input matrices K't = 
min{Af/n, y/n/m} times. Subsequently, through sorting 
and prefix computations the algorithm computes Mi for each 
r < £ < r + Kj. and determines the largest Kt < K[ such 
that X^fif^' -^^t — Then, the actual products in Gi, for 
each r < £ < r + Kt are executed and accumulated (again 
using a prefix computation) in the output matrix G. We have 
the following theorem. 

Theorem 4. Algorithm D2 multiplies two sparse ^/n x ^pn 
matrices with at most h nonzero entries each in 



O 



{h + d)y/n 



M 



Ids™ M 



rounds on an MR{'m, M), where 5 denotes the maximum 
number of nonzero entries in the output matrix. 

Proof: The correctness of the algorithm is trivial. 
Phase t requires a constant number of sorting and prefix 
computations to determine Kt and to add the partial con- 
tributions to the output matrix C. Since each value Mi 
is O (n + o) and the groups are ^Jn/m, clearly, Kt = 
Vt ^min{M/(7T, + o), ^Jn/m\^, and the theorem follows. ■ 

We remark that the value o appearing in the stated round 
complexity needs not be explicitly provided in input to the 
algorithm. We also observe that with respect to Algorithm 
Dl, Algorithm D2 features a better exploitation of the 
local memories available to the individual reducers, which 
compute y/rri x y^-size products rather than working at the 
granularity of the single entries. 

By suitably combining Algorithms Dl and D2, we can 
get the following result. 

Corollary 1. There is a deterministic algorithm which 
multiplies two sparse ^/n x matrices with at most n 
nonzero entries each in 



O 



M 



log™ M 



rounds on an MR{m^ M), where 5 denotes the maximum 
number of nonzero entries in the output matrix. 

3) Randomized algorithm Rl: Algorithm D2 requires 
O (log„j M) rounds in each Phase t for computing the 
number Kt of groups to be processed. However, if 5 were 
known, we could avoid the computation of Kt and resort to 
the fixed-if strategy adopted in the dense-dense case, by pro- 
cessing K = M / [n + o) consecutive groups per round. This 
would yield an overall O {{n + o)y/n/ {M^/rn) + log^ M) 
round complexity, where the log,„ M additive term accounts 
for the complexity of summing up, at the end, the K 



contributions to each entry of C. However, o may not be 
known a priori. In this case, using the strategy described in 
Section IIII-B4I we can compute a value o which is a 1/2- 
approximation to o with probability at least 1 — 1/n. (We 
say that o e-approximates o if |o — o| < eo.) Hence, in the 
algorithm we can plug in 2o as an upper bound to o. By using 
the result of Theorem |6] with e = 1/2 and 5 — l/(2n), we 
have: 

Theorem 5. Let m = Vt (log^n). Algorithm Rl multiplies 
two sparse \/n x ^/n matrices with at most h nonzero entries 
in 

„ , in -\- o)^/n 



rounds on an MR{m, M), with probability at least 1 — 

By comparing the rounds complexities stated in Corol- 
lary [T] and Theorem |5] it is easily seen that the random- 
ized algorithm Rl outperforms the deterministic strategies 
when m e (il (log'^ n) ,o(Af'^)), for any constant e, n > 
yjnjml log„j M, and b < h min{7i, \frri\ log„j M. For 
a concrete example, Rl exhibits better performance when 
h > yjn, o = 8 (n), and m is polylogarithmic in M. 
Moreover, both the deterministic and randomized strategies 
can achieve a constant round complexity for suitable values 
of the memory parameters. 

4) Evaluation of o: Observe that a y^-approximation to 
o derives from the following simple argument. Let at and hi 
be the number of nonzero entries in the ith column of A and 
in the ith row of B respectively, for each < i < -/n. Then, 

5 < X^i^cT^ '^i^i — Oy/n. Evaluating the sum requires 0(1) 
sorting and prefix computations, hence a -^n-approximation 
of o can be computed in O (log,„ n) rounds. However, such 
an approximation is too weak for our purposes and we show 
below how to achieve a tighter approximation by adapting 
a strategy born in the realm of streaming algorithms. 

Let e > and < (5 < 1 be two arbitrary values. An e- 
approximation to o can be derived by adapting the algorithm 
of |29J for counting distinct elements in a stream xqXi . . ., 
whose entries are in the domain [n] = {0, . . . , n — 1}. 
The algorithm of 1*291 makes use of a very compact data 
structure, customarily called sketch in the literature, which 
consists of A = 0(log(l/(5)) lists, Li, L2, ■ ■ ■ , La- For 

< w < A, Lw contains the t = 6([l/e^]) distinct 
smallest values of the set {(j>w{xi) : i > 0}, where 
(j)^ : [n] [n^] is a hash function picked from a pairwise 
independent family. It is shown in |29| that the median of 
the values tn? /vq^ . . . tn? /v/^-i, where denotes the tth 
smallest value in L^, is an e-approximation to the number 
of distinct elements in the stream, with probability at least 

1 — (5. In order to compute an e-approximation of o for a 
product C ^ A - B of ^/n x ^/n matrices, we can modify the 
algorithm as follows. Consider the stream of values in [n] 
where each element of the stream corresponds to a distinct 



product aijibh.j 7^ and consists of the value j + iy/n. 
Clearly, the number of distinct elements in this stream is 
exactly 6. (A similar approach has been used in |30| in the 
realm of sparse boolean matrix products.) We now show how 
to implement this idea on an MR(to, M). 

The MR-algorithm is based on the crucial observation 
that if the stream of values defined above is partitioned into 
segments, the sketch for the entire stream can be obtained 
by combining the sketches computed for the individual seg- 
ments. Specifically, two sketches are combined by merging 
each pair of lists with the same index and selecting the t 
smallest values in the merged list. The MR(m, Af)-algorithm 
consists of a number of phases, where each phase, except 
for the last one, produces set of M/m sketches, while the 
last phase combines the last batch of Mjm sketches into 
the final sketch, and outputs the approximation to 6. 

We refer to the partition of the matrices into ^Jm x ^Jm 
submatrices and group the products of submatrices as done 
before. In Phase t, with i > 1, the algorithm processes the 
products m K = min{A//ri, yjn/m\ consecutive groups, 
assigning each pair of submatrices in one of the K groups 
to a distinct reducer A reducer receiving Ai,h and Bh.j, each 
with at least a nonzero entry, either computes a sketch for 
the stream segment of the nonzero products between entries 
of Ai,h and Bhj, if the total number of nonzero entries 
of Ai^h and Bh,j exceeds the size of the sketch, namely 
H = O ((1/e^) log(l/(5)) words, or otherwise leaves the 
two submatrices untouched (observe that in neither case the 
actual product of the two submatrices is computed). In this 
latter case, we refer to the pair of (very sparse) submatrices 
as a pseudosketch. At this point, the sketches produced by 
the previous phase (if t > 1), together with the sketches and 
pseudosketches produced in the current phase are randomly 
assigned to M/m reducers. Each of these reducers can now 
produce a single sketch from its assigned pseudosketches (if 
any) and merge it with all other sketches that were assigned 
to it. In the last phase {t = yjnjml K) the Mjm sketches 
are combined into the final one through a prefix computation, 
and the approximation to o is computed. 

Theorem 6. Let ((1/e^) log(l/(5) log(n/(5)) and let 

e > and Q < b < \ be arbitrary values. Then, with 
probability at least 1 — 2(5, the above algorithm computes 
an e-approximation to o in 



M 



rounds, on an MR{'m, M) 



Proof: The correctness of the algorithm follows from 
the results of [29] and the above discussion. Recall that the 
value computed by the algorithm is an e-approximation to 
o with probability 1 — (5. As for the rounds complexity we 
observe that each phase, except for the last one, requires 
a constant number of rounds, while the last one involves a 



prefix computation thus requiring O (logm M) rounds. We 
only have to make sure that in each phase the memory con- 
straints are satisfied (with high probabiHty). Note also that 
a sketch of size H < m is generated either in the presence 
of a pair of submatrices Aiji, Bh.j containing at least H 
entries, or within one of the M/m reducers. By the choice 
of K, it is easy to see that in any case, the overall memory 
occupied by the sketches is O (M). As for the constraint on 
local memories, a simple modification of the standard balls- 
into-bins argument ||3T| and the union bound suffices to show 
that with probability 1 — 5, in every phase when sketches and 
pseudosketches are assigned to M/m reducers, each reducer 
receives in O (m + (1/e^) log(l/(5) log(n/(5)) = O (m) 
words. The theorem follows. (More details will be provided 
in the full version of the paper) ■ 

C. Sparse-Dense matrix multiplication 

Let A be a sparse ^Jn x -y/n matrix with at most n 
nonzero entries and let i? be a dense ^/n x -y/n matrix 
(the symmetric case, where A is dense and B sparse, is 
equivalent). The algorithm for dense-dense matrix multi- 
plication does not exploit the sparsity of A and requires 
O (n\fnl (My/m) + log.^ n) rounds. Also, if we simply 
plug n = ri in the complexities of the three algorithms for 
the sparse-sparse case (where h represented the maximum 
number of nonzero entries of A or B) we do not achieve 
a better round complexity. However, a careful analysis 
of algorithm Dl in the sparse-dense case reveals that its 
round complexity is O {\h^/n/AI~\ log„ M). Therefore, by 
interleaving algorithm Dl and the dense-dense algorithm we 
have the following corollary. 

Corollary 2. The multiplication on MR{m, M) of a sparse 
i/n X ^Jn matrix with at most h nonzero entries and of 
a dense y/n x ^Jn matrix requires a number of rounds 
which is the minimum between O {\hy/n/ logm M) and 
O {n^/{M^) + log„ n). 

Observe that the above sparse-dense strategy outper- 
forms all previous algorithms for instance when n — 

o{n/{y^\og^M)). 

D. Lower bounds 

In this section we provide lower bounds for dense-dense 
and sparse-sparse matrix multiplication. We restrict our at- 
tention to algorithms which perform all nonzero elementary 
products, that is, on conventional matrix multiplication |16|. 
Although this assumption limits the class of algorithms, 
ruling out Strassen-like techniques, an elaboration of a 
result in (23) shows that computing all nonzero elementary 
products is necessary when entries of the input matrices 
are from the semirings (N, +,•) and (N U {cx)}, min, +)0 

^The (N U {oo}, min, +) semiring, where oo is the identity of the min 
operation, is usually adopted while computing the shortest path matrix of 
a graph given its connection matrix. 



Indeed, we have the following lemma which provides a 
lower bound on the number of products required by an 
algorithm multiplying any two matrices of size ^/ri x y/n, 
containing ua and hs nonzero entries and where zero 
entries have fixed positions (a similar lemma holds for 
(N U {oo}, min, +)). As a consequence of the lemma, an 
algorithm that multiplies any two arbitrary matrices in the 
semiring (N, + , •) must perform all nonzero products. 

Lemma 1. Consider an algorithm A which multiples two 
^/n X ^/n matrices A and B with fiA and fiB nonzero 
entries, respectively, from the semiring (N, +, •) and where 
the positions of zero entries are fixed. Then, algorithm A 
must perform all the nonzero elementary products. 



Proof: 11231 shows that each c^.j can be computed only 
by summing all terms a,;./i • with Q < h < ^Jn, if the 
algorithm uses only semiring operations. The proof relies on 
the analysis of the output for some suitable input matrices, 
and makes some assumptions that force the algorithm to 
compute even zero products. However, the result still holds 
if we allow all the zero products to be ignored, but some 
adjustments are required. In particular, the input matrices 
used in f2j| do not work in our scenario because may 
contain less than fiA and ns nonzero entries, however it 
is easy to find inputs with the same properties working in 
our case. More details will be provided in the full version. 

■ 

The following theorem exhibits a tradeoff in the lower 
bound between the amount of local and aggregate memory 
and the round complexity of an algorithm performing con- 
ventional matrix multiplication. The proof is similar to the 
one proposed in fl6 1 for lower bounding the communication 
complexity of dense-dense matrix multiplication in a BSP- 
like model: however, differences arise since we focus on 
round complexity and our model does not assume the 
outdegree of a reducer to be bounded. In the proof of the 
theorem we use the following lemma which was proved 
using the red-blue pebbUng game in ll22l and then restated 
in 1 16] as follows. 

Lemma 2 ( ||I6| ). Consider the conventional matrix multipli- 
cation C = A-B, where A and B are two arbitrary matrices. 
A processor that uses Na elements of A, Nb elements of B, 
and contributes to Nq elements of C can compute at most 
(NaNsNc)^^^ multiplication terms. 

Theorem 7. Consider an MR{m, M)-algorithm A for multi- 
plying two y/ny. ^Jn matrices containing at most fiA andfiB 
nonzero entries, using conventional matrix multiplications. 
Let P and o denote the number of nonzero elementary 
products and the number of nonzero entries in the output 
matrix, respectively. Then, the round complexity of A is 



P 



MJm 



log„ 



Proof: Let A be an i?-round MR(to, Af )-algorithm 
computing C ^ A - B.Wt prove that R = U. {P/{M \/m)). 
Consider the r-th round, with 1 < r < R, and let k 
be an arbitrary key in Ur and Kr = \Ur\- We denote 
with Or.k the space taken by the output of pr{Wr,k) which 
contributes either to Or or to Wr+i, and with rrir.k the 
space needed to compute Pr{Wr.k) including the input and 
working space but excluding the output. Clearly, m^./c < m. 



Ek 



eu. 



rrir^k < M, and J2k 



eu,. 



Or,k 



<5< M. 



Suppose M/Kr > m. By Lemma |2l the reducer pr 
with input Wr,k can compute at most m^Jo^ elementary 
products since NA^ Nb < m and Nc < Or^k, where Na and 
Nb denote the entries of A and B used in Pr{Wr.k) and Nc 
the entries of C for which contributions are computed by 
Pr{Wr,k)- Then, the number of terms computed in the r-th 

/Or^k < m V MK-r < 



round is at most J2keu ™\/Or,/£ < m\/MKr < M^m, 
since iiT^ < M/m and the summation is maximized when 
Or,k = M/Kr for each A; G C/r- 

Suppose now that M/Kj. < m. Partition the keys in Ur 
into K'j, sets 5*0, . . . Sk'^-i such that m < X)fees ™r,/c < 
2™ for each < j < if', (the lower bound may be 
not satisfied for j = K'^ - 1). Cleai'ly, [M/2mJ < 
-f'^r — iM/m]. By Lemma |2] the number of elementary 
products computed by all the reducers Pr{Wr^k) with keys 
in a set Sj is at most '^i.^g .{''^r,kmr,kOr,ky^^- Since 
(xyz)i/2 + (x'y'z')'/' < {{x + x'){y + y'){z + z')fl^ 
for each non negative assignment of the x^y, z^x' ,y' , z' 
variables and since X^fces ''^r,k < 2to, it follows that 
at most 2myJOr.j elementary products can be computed 
using keys in Sj, where Or.j — J2keS- '^r.k- Therefore, the 
number of elementary products computed in the r-th round 
is at most YlfZo^^my^O^ < 2m^MK[. < 2M\/2to, 
since K'^ < \M/m] and the sum is maximized when 
Orj = M/Kl. for each < j < K^.. 

Therefore, in each round O (M ^ym.) nonzero elementary 
products can be computed, and then R — ft ([P/M-^to]). 
The second term of the lower bound follows since there is 
at least one entry of C given by the sum of P/d nonzero 
elementary products. ■ 

We now specialize the above lower bound for algorithms 
for generic dense-dense and sparse-sparse matrix multipli- 
cation. 

Corollary 3. An MR{m, M)-algorithm for multiplying any 
two dense ^/n x y/n matrices, using conventional matrix 
multiplication, requires 

( TV'I^ 



rounds. On the other hand, an MR{m, M)-algorithm for 
multiplying any two sparse matrices with at most h nonzero 
entries requires 

nmin{n, \/n\ 



M 



rounds. 

Proof: In the dense-dense case the lower bound follows 
by the above Theorem |7] since we have P — r?!'^ and 5 = n 
when fiA = fiB = n. In the sparse-sparse case, we set 
fiA = fiB = fi and we observe that there exist assignments 
of the input matrices for which P ^ h min{n, \/n}, and 
others where P/o = il{n) ■ 
The deterministic algorithms for matrix multiplication 
provided in this section perform conventional matrix multi- 
plication, and hence the above corollary applies. Thus, the 
algorithm for dense-dense matrix multiplication described in 
Section IIII-AI is optimal for any value of the parameters. 
On the other hand, the deterministic algorithm D2 for 
sparse-sparse matrix multiplication given in Section iriI-B2l is 
optimal as soon as n > ^/n, 5 = (n) and m is polynomial 
in M. 

IV. Applications 

Our matrix multiplications results can be used to derive 
efficient algorithms for inverting a square matrix and for 
solving several variants of the matching problem in a graph. 
The algorithms in this section make use of division and 
exponentiation. To avoid the intricacies of dealing with 
limited precision, we assume each memory word is able to 
store any value that occurs in the computation. A similar 
assumption is made in the presentation of algorithms for the 
same problems on other parallel models (see e.g. fT2l|). 

A. Inverting a lower triangular matrix 

In this section we study the problem of inverting a lower 
triangular matrix A of size ^/n x ^/n. We adopt the simple 
recursive algorithm which leverages on the easy formula for 
inverting a 2 x 2 lower triangular matrix |12, Sect. 8.2]. We 
have 

' r, n 1 r ^-1 n 

(1) 

For < A: < (1/2) log(n/m) and < i,j < 2^, let 

be the submatrix resulting from the splitting of A into 

submatrices of size {-s/n/2^) x {y/n/2'^). Since Equation ([U 

holds even when a, 6, c are matrices, we have that (^A'f'^^ 
can be expressed as in Equation (|2|i in Figure [T] Note that 

A-^ = (An' ' 
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—c^^ba^^ cr^ 



The MR(to, A/)-algorithm for computing the inverse 
of A works in (1/2) log(n/m) phases. Let Vr = 
(l/2)log(n/m) - r for < r < (1/2) log(n/m). In the 
first part of Phase 0, the inverses of all the lower triangular 
submatrices A^^'°\ with < ^ < m, are computed 



in parallel. Since each submatrix has size y/m x y/m, each 
inverse can be computed sequentially within a single reducer. 
In the second part of Phase 0, each product 
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Figure 1 . Equation ^ - Expression for ( A 
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for < w < (l/2)-\/n/m, is computed within a reducer. 
In Phase r, with 1 < r < (1/2) log(n/m), each term 

\-^2w+l,2w + l J ^2w + l,2w I ^2u)+l,2t« + l I ' 

for < w < 2'"''+'^, is computed in parallel by per- 
forming two matrix multiplications using M/2'"''+^ aggre- 
gate memory and local size m. Therefore, at the end of 
Phase (1/2) log(7i/TO) — 1 we have all the components of 

(<))"\i.e.,of A-i. 

Theorem 8. The above algorithm computes the inverse of 
a nonsingular lower triangular ^/n x ^/n matrix A in 



O 



i3/2 log 



^ M yjrn log m 

rounds on an MR{m, M). 

Proof: The correctness of the algorithm follows from 
the correctness of (|2|i which in turns easily follows from the 
correctness of the formula to invert a lower triangular 2x2 
matrix. From the above discussion it easy to see that the 
memory requirements are all satisfied. 

We now analyze the round complexity of the algorithm. 
At Phase r we have to compute 2^+"'+i = (1/2)'' -^/n/m 
products between matrices of size ^JnjT'^ x = 
2"^ ^/my. 2^y/m. Each product is computed in parallel by us- 
ing M/2'"''+^ — A/2''+^-\/r7i/n > 1 aggregate memory and 
thus each Phase r requires O {l?"^ y/mnjM + log„j(2''TO)) 
rounds by using the algorithm described in Section IIII-AI 
The cost of the lower triangular matrix inversion algorithm 
is then 



(2'-™)) 



which gives the bound stated in the theorem. ■ 

If M^/rn is Vl (jv'/'^^ and m = 57 [rf) for some constant 
e, the complexity reduces to O (log n) rounds, which is a 
logarithmic factor better than what could be obtained by 
simulating the PRAM algorithm. 

It is also possible to compute A^^ using the closed for- 
mula derived by unrolling a blocked forward substitution. In 
general, the closed formula contains an exponential number 
of terms. There are nonetheless special cases of matrices 
for which a large number of terms in the sum are zero 
and only a polynomial number of terms is left. This is, for 




instance, the case for triangular band matrices. (Note that 
the inverse of a triangular band matrix is triangular but not 
necessarily a triangular band matrix.) If the width of the 
band is O (771 log n), then we have a polynomial number of 
terms in the formula. In this case we can do matrix inversion 
in constant rounds for sufficiently large values of m and M . 
A complete discussion of this method will be presented in 
the full version of the paper. 

B. Inverting a general matrix 

Building on the inversion algorithm for triangular matrices 
presented in the previous subsection, and on the dense-dense 
matrix multiplication algorithm, in this section we develop 
an MR(7T7, Af )-algorithm to invert a general -/nx ^Jn matrix 
A. Let the trace ^^(A) of A be defined as X^iLo^ where 
ai_i denotes the entry of A on the i-th row and i-th column. 
The algorithm is based on the following known strategy (see 
e.g.. Gal Sect. 8.8]). 

1) Compute the powers A^, . . . , A^~^. 

2) Compute the traces Sk = for 1 < fc < 

3) Compute the coefficients Ci of the characteristic poly- 
nomial of A by solving a lower triangular system 
of ^/n linear equations involving the traces Sk (the 
system is shown below). 

4) Compute A-^ = -(1/co) J2^i c^A^^^ . 

We now provide more details on the MR implementation 
of above strategy. The algorithm requires M — $7 (77^/^), 
which ensures that enough aggregate memory is available to 
store all the ^Jn powers of A. In Step [T] the algorithm com- 
putes naively the powers in the form A? , \ < i < log ^Jn, 
by performing a sequence of log ^Jn matrix multiplications 
using the algorithm in Section IIII-AI Then, each one of 
the remaining powers is computed using Mj %/n > n 
aggregate memory and by performing a sequence of at most 
log -/n multiplications of the matrices A^ obtained earlier. 
In Step 12] the ^Jn traces Sk are computed in parallel using 
a prefix like computation, while the coefficients q of the 
characteristic polynomial are computed in Step [3] by solving 
the following lower triangular system: 
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If we denote with L the matrix on the left hand side, with C 
the vector of unknowns, and with S the vector of the traces 
on the right hand side, we have C = —L'^^S. In order to 
compute the coefficients in C the algorithm inverts the y/n x 
y/n lower triangular matrix L as described in Section |TV-A| 
and computes the product between L^^ and S, to obtain C. 
Finally, Step |4] requires a prefix like computation. We have 
the following theorem. 

Theorem 9. The above algorithm computes the inverse of 
any nonsingular ^/n x ^/n matrix A in 

^ / log n ^ \o^_n\ 
\ My/ra logm / 

rounds on MR{m,M), with M = fl (n^/^). 

Proof: For the correctness of the algorithm see lfT2l 
Sect. 8.8]). It is easy to check that the memory requirements 
of the MR(r7i, M) model are satisfied. We focus here on 
analyzing the round complexity. 

Computing the powers in the form , 1 < i < 
log requires O (n^/^ log n/{My^) + (log^ n)/ logm) 
rounds, since the algorithm performs a sequence of 
logy^ products. The remaining powers are computed 
in O (n^ log n/ {M \/rn) + (log^ n) / (log to)) rounds since 
each power is computed by performing at most logy^ 
product using AI /y/n aggregate memory. The prefix 
like computation for finding the y/n traces Sk re- 
quires O {logm rounds, while the linear system takes 
0(n3/2/(MVTO) + (log^n)/(logTO)) rounds. The final 
step takes O (log„ n) rounds using a prefix like computa- 
tion. The round complexity in the statement follows. ■ 

If My/m is ri [n^logn) and to = ^{rf) for some 
constant e, the complexity reduces to O (log n) rounds, 
which is a quadratic logarithmic factor better than what 
could be obtained by simulating the PRAM algorithm. 

C. Approximating the inverse of a matrix 

The above algorithm for computing the inverse of any 
nonegative matrix requires M = O (n^/^). In this section 
we provide an MR(to, A/)-algorithm providing a strong 
approximation of A^^ assuming M — 57 (n). A matrix B is 
a strong approximation of the inverse of an y/n x y/n matrix 
A if < 2-"^ for some constant c> 0. 

The norm of a matrix A is defined as 

= max||^x||2/||x||2 

where || • II2 denotes the Euclidean norm of a vector The 
condition number k,{A) of a matrix A is defined as n{A) = 
ll^llll^-'ll- 

An iterative method to compute a strong approximation 
of the inverse of a y/n x y/n matrix A is proposed in lfT2l 
Sect. 8.8.2]. The method works as follows. Let Bq be a 
^Jn X ^Jn matrix satisfying the condition ||/^ — i3o^|| = q 



for some < g < 1 and where is the y/n x y/n identity 
matrix. For a y/n x y/n matrix C let r{C) = — CA. We 
define Bk = (/^ + r{Bk-i))Bk-i, for fc > 0. We have 

By setting Bq — aA^ where a = 
ma,x,{J2]^o^ \ai,j\}maxj{J2t^(r^ \ai,j\}, we have 
q = I - l/{K{A)'^n) Ea. Then, if k{A) = O {n") for 
some constant c > 0, Bk provides a strong approximation 
when k = & (logn). From the above discussion, it is easy 
to derive an efficient MR(m, Af)-algorithm to compute a 
strong approximation of the inverse of a matrix using the 
algorithm for dense matrix multiplication in Section IIII-AI 

Theorem 10. The above algorithm provides a strong ap- 
proximation of the inverse of any nonegative y/nx yfn matrix 
A in 

^(rt'l^Xogn log^ri\ 
\ Myfm logm / 

rounds on an MR{'m, M) when k{A) — O (n^) for some 
constant c > 0. 

Proof: The correctness of the algorithm derives 
from 1 12 1. Once again we only focus on the round complex- 
ity of the algorithm. Computing a requires a a constant num- 
ber of prefix like computations, and hence takes O (log„ n) 
rounds. To compute Bk, k > from Bk-i, we need the 
value r{Bk-i) which involves a multiplication between two 
^/n X y/n matrices and a subtraction between two matrices. 
Hence, each phase requires O (n'^/^/(Af-yTO) + log„ n) 
rounds. Since the algorithm terminates when k — Q (log n), 
the theorem follows. ■ 

D. Matching of general graphs 

A strategy for computing, with probability at least 1/2, a 
perfect matching of a general graph using matrix inversion 
is presented in ifTTI . The strategy is the following: 

1) Let the input of the algorithm be the adjacency matrix 
A of a graph G ~ (V, E) with y/n vertices and k 
edges. 

2) Let B be the matrix obtained from A by substituting 
the entries j = aj_i = 1 corresponding to edges in 
the graph with the integers 2"'' ^ and — 2™' J respec- 
tively, for < i < j < y/n, where Wi,j is an integer 
chosen independently and uniformly at random from 
[l,2fc]. We denote the entry on the ith row and jth 
column of B as . 

3) Compute the determinant det{B) of B and the greatest 
integer w such that 2"' divides det{B). 

4) Compute adj{B), the adjugate matrix of B, and 
denote the entry on the ith row and jth column as 
adj{B)ij. 



5) For each edge {vi, Vj) e E, compute 

_ bi.j ■ adj{B)ij 

If is odd, then add the edge {vi,Vj) to the 
matching. 

An MR(m, M)-algorithm for perfect matching easily fol- 
lows by the above strategy. We now provide more details on 
the MR implementation which assumes AI = Q (n'^/'^). 

In Step |2] i? is obtained as follows. The algorithm 
partitions A into square y/m x y/m submatrices Ag^h, 

< i,h < ^njra, and then assigns each pair of subma- 
trices {Ai,h, Ah^i) to a different reducer This assignment 
ensures that each pair of entries {aij,aj^i) of A is sent 
to the same reducer. Consider now the reducer receiving 
the pair of submatrices {Agji,Afi^i) and consider the set 
of pairs {aij,aj^i) of A such that Uij = aj^i ~ 1, where 
£y/rn < i < {i + l)y/m, h^/m < j < {h + l)y/rn, and 

1 < j. For each of these pairs the reducer chooses a Wij 
independently and uniformly at random from [l,2fc], and 
sets bi,j to 2™' J and 6j ^ to —2™' 3 . For all the other entries 
aij = aj,i = 0, the reducer sets bij — hj^i = 0. 

Let Cfc, < k < y/n be the coefficients of the char- 
acteristic polynomial of B, which can be computed as 
described in Section IIV-BI Steps [3] and |4] can be easily 
implemented since the determinant of B is cq and adj{B) = 
-{cil + C2B + + . . . + c^B^-i). 

Finally, in Step |5] matrices B and adj [B] are partitioned 
in square submatrices of size -/m x -^m, and corresponding 
submatrices assigned to the same reducer, which computes 
the values j for the entries in its submatrices and outputs 
the edges belonging to the matching. 

Theorem 11. The above algorithm computes, with proba- 
bility at least 1/2, a perfect matching of the vertices of a 
graph G, in 

^ / log n _^ log^n \ 
\ Myfm logm / 

rounds on MR{m, M), where M — Q (n'^/^). 

Proof: The correctness of the algorithm follows from 
the correctness of ifTTIl and it is easy to see that the memory 
requirements of the MR(m, M) model are satisfied. We 
focus here on the round complexity. From the above de- 
scription, it is easy to see that the computation of B and the 
Wij's in Step |2] only takes one round. Steps [3] and |4] require 
the computation of the coefficients of the characteristic 
polynomial of B, and so takes a number of rounds equal 
to the algorithm for matrix inversion described in Sec- 
tionH^H i.e., O {{n^ \ogn) / [M y/m) + (log^ n)/(logm)). 
Step |5] takes one round. Since the round complexity is 
dominated by the number of rounds needed to compute 
the coefficients of the characteristic polynomial of B, the 
theorem follows. ■ 



We note that matching is as easy as matrix inversion In 
the MR(m, M) model. The above result can be extend to 
minimum weight perfect matching, to maximum matching, 
and to other variants of matching in the same way as in ifTTl 
Sect. 5]. 

V. Conclusions 

In this paper, we provided a formal computational model 
for the MapReduce paradigm which is parametric in the 
local and aggregate memory sizes and retains the functional 
flavor originally intended for the paradigm, since it does 
not require algorithms to explicitly specify a processor 
allocation for the reduce instances. Performance in the 
model is represented by the round complexity, which is 
consistent with the idea that when processing large data 
sets the dominant cost is the reshuffling of the data. The 
two memory parameters featured by the model allow the 
algorithm designer to explore a wide spectrum of trade- 
offs between round complexity and memory availability. 
In the paper, we covered interesting such tradeoffs for the 
fundamental problem of matrix multiplication and some of 
its applications. The study of similar tradeoffs for other 
important applications (e.g., graph problems) constitutes an 
interesting open problem. 

Acknowledgments 

The work of Pietracaprina, Pucci and Silvestri was sup- 
ported, in part, by MIUR of Italy under project AlgoDEEP, 
and by the University of Padova under the Strategic Project 
STPD08JA32 and Project CPDA099949/09. The work of 
Riondato and Upfal was supported, in part, by NSF award 
IIS-0905553 and by the University of Padova through the 
Visiting Scientist 2010/2011 grant. 

References 

[1] J. Dean and S. Ghemawat, "MapReduce: simplified data 
processing on large clusters," Communications of the ACM, 
vol. 51, no. 1, pp. 107-113, 2008. 

[2] J. Lin and C. Dyer, Data-Intensive Text Processing with 
MapReduce. Morgan & Claypool, 2010. 

[3] S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii, "Filter- 
ing: a method for solving graph problems in MapReduce," 
in Proc. of the 23rd ACM Symp. on Parallel Algorithms and 
Architectures, 2011, pp. 85-94. 

[4] H. Karloff, S. Suri, and S. Vassilvitskii, "A model of com- 
putation for MapReduce," in Proc. of the 21st ACM-SIAM 
Symp. On Discrete Algorithms, 2010, pp. 938-948. 

[5] M. Goodrich, N. Sitchinava, and Q. Zhang, "Sorting, search- 
ing, and simulation in the MapReduce framework," in Proc. 
of the 22nd International Symp. on Algorithms and Compu- 
tation, 2011, to appear. See also CoRR abs/1004.470. 

[6] J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and 
Z. Svitkina, "On distributing symmetric streaming computa- 
tions," ACM Transactions on Algorithms, vol. 6, no. 4, 2010. 



[7] L. Valiant, "A bridging model for parallel computation," 
Communications of the ACM, vol. 33, no. 8, pp. 103-111, 
Aug. 1990. 

[8] G. Bilardi and A. Pietracaprina, "Theoretical models of com- 
putation," in Encyclopedia of Parallel Computing, D. Padua, 
Ed. Springer, 2011, to appear. 

[9] F. Chierichetti, R. Kumar, and A. Tomkins, "Max-cover in 
Map-Reduce," in Proc. of the 19th World Wide Web Confer- 
ence, 2010, pp. 231-240. 

[10] C. Tsourakakis, U. Kang, G. Miller, and C. Faloutsos, 
"DOULION: counting triangles in massive graphs with a 
coin," in Proc. of the 15th ACM SIGKDD Intl. Conference 
on Knowledge Discovery and Data Mining, 2009, pp. 837- 
849. 

[11] K. Mulmuley, U. V. Vazirani, and V. V. Vazirani, "Matching 
is as easy as matrix inversion," Combinatorica, vol. 7, no. 1, 
pp. 105-113, 1987. 

[12] J. JSJd, An Introduction to Parallel Algorithms. Addison 
Wesley Longman Publishing Co., Inc., 1992. 

[13] J. Gilbert, V. Shah, and S. Reinhardt, "A unified framework 
for numerical and combinatorial computing," Computing in 
Science Engineering, vol. 10, no. 2, pp. 20-25, 2008. 

[14] R. Yuster and U. Zwick, "Detecting short directed cycles 
using rectangular matrix multiplication and dynamic pro- 
gramming," in Proc. of 15th ACM-SIAM Symp. On Discrete 
Algorithms, 2004, pp. 254-260. 

[15] G. Penn, "Efficient transitive closure of sparse matrices over 
closed semirings," Theoretical Computer Science, vol. 354, 
no. 1, pp. 72-81, 2006. 

[16] D. Irony, S. Toledo, and A. Tiskin, "Communication lower 
bounds for distributed-memory matrix multiplication," Jour- 
nal of Parallel and Distributed Computing, vol. 64, no. 9, pp. 
1017-1026, 2004. 

[17] W. F. McColl and A. Tiskin, "Memory-efficient matrix mul- 
tiplication in the BSP model," Algorithmica, vol. 24, no. 3, 
pp. 287-297, 1999. 

[18] M. Middendorf, H. Schmeck, H. Schrder, and G. Turner, 
"Multiplication of matrices with different sparseness prop- 
erties on dynamically reconfigurable meshes," VLSI Design, 
no. 9, pp. 69-81, 1999. 

[19] G. Manzini, "Sparse matrix computations on the hypercube 
and related networks," Journal of Parallel and Distributed 
Computing, vol. 21, no. 2, pp. 169-183, 1994. 



[21] A. Buluf and J. R. Gilbert, "Challenges and advances in 
parallel sparse matrix-matrix multiplication," in Proc. of 37th 
International Conference on Parallel Processing, 2008, pp. 
503-510, see also CoRR abs/1006.2183. 

[22] J.-W. Hong and H. T. Kung, "I/O complexity: The red-blue 
pebble game," in Proceedings of the 13th ACM Symp. on 
Theory of Computing, 1981, pp. 326-333. 

[23] L. R. Kerr, "The effect of algebraic structure on the compu- 
tational complexity of matrix multiplication," Ph.D. disserta- 
tion, Cornell University, 1970. 

[24] F. G. Gustavson, "Two fast algorithms for sparse matrices: 
Multiplication and permuted transposition," ACM Transac- 
tions on Mathematical Software, vol. 4, no. 3, pp. 250-269, 
1978. 

[25] R. Yuster and U. Zwick, "Fast sparse matrix multiplication," 
ACM Transactions on Algorithms, vol. 1, no. 1, pp. 2-13, 
2005. 

[26] G. Greiner and R. Jacob, "The I/O complexity of sparse 
matrix dense matrix multiplication," in Proc. of 9th Latin 
American Theoretical Informatics, 2010, vol. 6034, pp. 143- 
156. 

[27] M. Goodrich, "Simulating parallel algorithms in the MapRe- 
duce framework with applications to parallel computational 
geometry," 2010, coRR abs/1004.470. 

[28] M. T. Goodrich, "Communication-efficient parallel sorting," 
SIAM Journal on Computing, vol. 29, no. 2, pp. 416-432, 
1999. 

[29] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and 
L. Trevisan, "Counting distinct elements in a data stream," 
in Proc. of the 6th Int. Workshop on Randomization and 
Approximation Techniques, 2002, pp. 1-10. 

[30] R. Amossen, A. Campagna, and R. Pagh, "Better size esti- 
mation for sparse matrix products," in Approximation, Ran- 
domization, and Combinatorial Optimization. Algorithms and 
Techniques, ser. Lecture Notes in Computer Science, 2010, 
vol. 6302, pp. 406^19. 

[31] M. Mitzenmacher and E. Upfal, Probability and Computing: 
Randomized Algorithms and Probabilistic Analysis. Cam- 
bridge University Press, Cambridge MA, 2005. 

[32] V. Pan and J. Reif, "Efficient parallel solution of linear 
systems," in Proc. of the 17th ACM Symp. on Theory of 
Computing, 1985, pp. 143-152. 



[20] C. P. Kruskal, L. Rudolph, and M. Snir, "Techniques for par- 
allel manipulation of sparse matrices," Theoretical Computer 
Science, vol. 64, no. 2, pp. 135-157, 1989. 



